Google play apps

Manipulation and cleaning data

Back to top

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
In [2]:
df=pd.read_csv("googleplaystore.csv")
In [3]:
df.head()
Out[3]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
In [4]:
print("The number of duplicate we have is:",len(df) -len(df.drop_duplicates()))
df.drop_duplicates(inplace=True)
The number of duplicate we have is: 483
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10358 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10358 non-null  object 
 1   Category        10358 non-null  object 
 2   Rating          8893 non-null   float64
 3   Reviews         10358 non-null  object 
 4   Size            10358 non-null  object 
 5   Installs        10358 non-null  object 
 6   Type            10357 non-null  object 
 7   Price           10358 non-null  object 
 8   Content Rating  10357 non-null  object 
 9   Genres          10358 non-null  object 
 10  Last Updated    10358 non-null  object 
 11  Current Ver     10350 non-null  object 
 12  Android Ver     10355 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
In [6]:
print('The columns with null values are: Rating - Type - Content Rating - Current Ver - Android Ver.')
print("We can see alot of null values in the Rating column:")
df.isnull().sum()
The columns with null values are: Rating - Type - Content Rating - Current Ver - Android Ver.
We can see alot of null values in the Rating column:
Out[6]:
App                  0
Category             0
Rating            1465
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

========================================================================================================================

The info table we run earlier shows that the columns Price and Installs are not numbers, we need to change them do we can do arithmetic operation

In [7]:
df[["Price","Installs"]].head()
Out[7]:
Price Installs
0 0 10,000+
1 0 500,000+
2 0 5,000,000+
3 0 50,000,000+
4 0 100,000+
In [8]:
df[df['Price']!="0"][["Price","Installs"]].head()
Out[8]:
Price Installs
234 $4.99 100,000+
235 $4.99 100,000+
427 $3.99 100,000+
476 $3.99 10,000+
477 $6.99 1,000+

Now we Notice 3 characters that need to be removed $ , +

In [9]:
for i in ["Price","Installs"]:
    for z in ["$",",","+"]:
        df[i]=df[i].str.replace(z,"")
In [10]:
df[df['Price']!="0"][["Price","Installs"]].head()
Out[10]:
Price Installs
234 4.99 100000
235 4.99 100000
427 3.99 100000
476 3.99 10000
477 6.99 1000
In [11]:
df["Price"]=df["Price"].str.replace("Everyone","")
df["Installs"]=df["Installs"].str.replace("Free","")
In [12]:
df["Price"]=pd.to_numeric(df["Price"])
df["Installs"]=pd.to_numeric(df["Installs"])

========================================================================================================================

Examine the app category share in the platform based on the number of installs.

Back to top

In [13]:
df.head(2)
Out[13]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10000.0 Free 0.0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500000.0 Free 0.0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
In [14]:
fig = px.sunburst(df, path=['Category', 'Genres'], values='Installs',title='Hierarchy of apps')
fig.show()

Extend the graph of sorting to show the mean of rating, the sum of installs, size, number of reviews for each category.

Back to top

In [15]:
df['Reviews']=pd.to_numeric(df['Reviews'].str.replace(".0M","000000"))
In [16]:
arr=[]
for i in df['Size']:
    if "M" in i:
        i=i.replace("M","")
    elif "k" in i:
        i=float(i.replace("k",""))*10**-3
    elif "Varies with device" in i:
        i=""
    elif "+" in i:
        i=i.replace("+","")
    elif "," in i:
        i=i.str.replace(",","")
    arr.append(i)
arr=pd.Series(arr)
df['Size']=pd.to_numeric(arr.str.replace(",",""))
In [17]:
f_cat_count=df.groupby('Category').agg({"App":"count","Rating":"mean","Reviews":"sum","Size":"sum","Installs":"sum"}).reset_index().sort_values("App",ascending=False).reset_index(drop=True)
In [18]:
f_cat_count.head()
Out[18]:
Category App Rating Reviews Size Installs
0 FAMILY 1943 4.191153 396771969 31815.6 1.004169e+10
1 GAME 1121 4.281285 1415536650 25593.2 3.154402e+10
2 TOOLS 843 4.047411 273185044 12521.5 1.145277e+10
3 BUSINESS 427 4.102593 12358171 8148.4 8.636649e+08
4 MEDICAL 408 4.182450 1396757 6329.7 4.220418e+07
In [19]:
f_cat_count['Reviews']=round(f_cat_count['Reviews']/10**6,2)
f_cat_count.rename(columns = {'Reviews':'Reviews_in_M'}, inplace = True) 
In [20]:
f_cat_count['Installs']=round(f_cat_count['Installs']/10**9,2)
f_cat_count.rename(columns = {'Installs':'Installs_in_B'}, inplace = True) 
In [21]:
f_cat_count['Size']=round(f_cat_count['Size']/10**3,1)
f_cat_count.rename(columns = {'Size':'Size_in_G'}, inplace = True) 
In [22]:
f_cat_count['Rating']=round(f_cat_count['Rating'],2)
In [23]:
f_cat_count.index=f_cat_count.index+1
f_cat_count.head()
Out[23]:
Category App Rating Reviews_in_M Size_in_G Installs_in_B
1 FAMILY 1943 4.19 396.77 31.8 10.04
2 GAME 1121 4.28 1415.54 25.6 31.54
3 TOOLS 843 4.05 273.19 12.5 11.45
4 BUSINESS 427 4.10 12.36 8.1 0.86
5 MEDICAL 408 4.18 1.40 6.3 0.04
In [24]:
fig = px.bar(f_cat_count, x='Category', y='App',hover_data=['Rating', 'Reviews_in_M',"Size_in_G","Installs_in_B"],
            title='Apps count of each category')
fig.update_layout(xaxis_tickangle=-45)
fig.show()

=========================================================================

Making a heat map to see the relationship between Rating and size of the app and price.

Back to top

In [25]:
df.head(2)
Out[25]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19.0 10000.0 Free 0.0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000.0 Free 0.0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
In [26]:
df2=df.dropna()
fig = px.density_heatmap(df2, x="Size", y="Rating", marginal_x="histogram", marginal_y="histogram",
                         title='The relation between the size and the rating score of apps')
fig.show()
In [27]:
fig = px.density_heatmap(df2, x="Price", y="Rating", marginal_x="histogram", marginal_y="histogram",
                        title='The relation between the size and the price score of apps')
fig.show()
In [28]:
df2.head()
Out[28]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19.0 10000.0 Free 0.0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000.0 Free 0.0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7 5000000.0 Free 0.0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25.0 50000000.0 Free 0.0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8 100000.0 Free 0.0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up

Looking at app price distribution of top 15 category with and after removing overlays.

Back to top

In [29]:
px.strip(df2[df2['Category'].isin(list(f_cat_count['Category'][:15]))],
         category_orders={"Category":list(f_cat_count['Category'][:15])},
         x="Price", y="Category",color="Type",hover_data=['App',"Rating"],
        title='The distribution of the price for each app category')
In [30]:
px.strip(df2[(df2['Category'].isin(list(f_cat_count['Category'][:15]))) & (df2['Price']<=100)],
         category_orders={"Category":list(f_cat_count['Category'][:15])},
         x="Price", y="Category",color="Type",hover_data=['App',"Rating"],
        title='The distribution of the price for each app category without outliers')

Number of install distribution for the free and paid app.

Back to top

In [31]:
px.box(df2,y='Installs',color='Type',log_y=True,hover_data=["App"],
       title='Distribution of app installed with the y-axis being in logarithmic scale for ease of visualization')

Adding review data and examine review polarity(positive-negative) base on app content rating(kids, teen, adult, everyone) and base of the type of the app(Free paid).

Back to top

In [32]:
reviews_df = pd.read_csv("googleplaystore_user_reviews.csv")
In [33]:
df3=pd.merge(df2, reviews_df, on = 'App', how = "inner").dropna()
df3.head()
Out[33]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity
0 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000.0 Free 0.0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up A kid's excessive ads. The types ads allowed a... Negative -0.250 1.000000
1 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000.0 Free 0.0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up It bad >:( Negative -0.725 0.833333
2 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000.0 Free 0.0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up like Neutral 0.000 0.000000
4 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000.0 Free 0.0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up I love colors inspyering Positive 0.500 0.600000
5 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000.0 Free 0.0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up I hate Negative -0.800 0.900000
In [34]:
px.box(df3,y='Sentiment_Polarity',color='Type',hover_data=["App","Translated_Review"],
      title='The distribution of rating sentiment polarity of apps')
In [35]:
df3.describe(include='all')
Out[35]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity
count 47364 47364 47364.000000 4.736400e+04 47364.000000 4.736400e+04 47364 47364.000000 47364 47364 47364 47364 47364 47364 47364 47364.000000 47364.000000
unique 650 33 NaN NaN NaN NaN 2 NaN 5 64 207 416 20 21608 3 NaN NaN
top Angry Birds Classic GAME NaN NaN NaN NaN Free NaN Everyone Action July 31, 2018 Varies with device 4.1 and up Good Positive NaN NaN
freq 1365 15393 NaN NaN NaN NaN 46965 NaN 36324 5311 4024 12599 13030 242 29916 NaN NaN
mean NaN NaN 4.350272 3.071857e+06 32.276746 8.765923e+07 NaN 0.043968 NaN NaN NaN NaN NaN NaN NaN 0.148750 0.497285
std NaN NaN 0.262259 7.725467e+06 26.737134 2.023989e+08 NaN 0.562501 NaN NaN NaN NaN NaN NaN NaN 0.327719 0.232981
min NaN NaN 2.700000 4.600000e+01 1.100000 1.000000e+03 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN -1.000000 0.000000
25% NaN NaN 4.200000 2.743900e+04 9.800000 1.000000e+06 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN -0.016667 0.390000
50% NaN NaN 4.400000 3.387420e+05 22.000000 1.000000e+07 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.121605 0.509524
75% NaN NaN 4.500000 2.440695e+06 52.000000 1.000000e+08 NaN 0.000000 NaN NaN NaN NaN NaN NaN NaN 0.350000 0.629048
max NaN NaN 4.900000 7.815831e+07 100.000000 1.000000e+09 NaN 9.990000 NaN NaN NaN NaN NaN NaN NaN 1.000000 1.000000
In [36]:
df3["Content Rating"].replace("Everyone 10+","Everyone",inplace=True)
In [37]:
px.box(df3[df3['Category'].isin(['GAME', 'FAMILY', 'HEALTH_AND_FITNESS', 'DATING', 'PRODUCTIVITY'])],
       y='Sentiment_Polarity',color='Type',hover_data=["App","Translated_Review"],
       facet_col='Content Rating',facet_row="Category",height=1000,
      category_orders={"Content Rating": ["Everyone", "Teen", "Mature 17+"]},
      title='The distribution of rating sentiment polarity of apps based on Content rating and top 5 app category with the highest number of apps')